Overview

This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted starting in April 2020. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.

In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every Tuesday morning we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the target submissions. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday.

Incident Case Forecasts

Summary Tables

The first table evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.

The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.

Inclusion criteria for each column are detailed below the table.

Accuracy Table

In order to calculate each column in our table, different inclusion criteria were applied.

  • The first column in the table lists all models that have contributed forecasts for 5 or more weeks total since the beginning of April, or models that have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

  • The next column lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.

  • Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have forecasts for 50% or more of the evaluated forecasts in the most recent evaluation period.

  • Column 5 shows the number of historical models a team has submitted. All teams that have submitted at least 5 forecasts and/or 2 forecasts out of the last 3 weeks is included in this count.

  • Columns 6 and 7 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.

Coverage Table

For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning the first week in April at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.

To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, highlight that section on the graph or use the zoom functionality.

1 Week Horizon WIS

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is often larger error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot.

4 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.

Truth data

This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated.

Incident Death Forecasts

Summary Tables

The first table below evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.

The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.

Inclusion criteria for each column are detailed below the table.

Accuracy Table

In this table, we have included all models with an eligible WIS or MAE score.

In order to meet eligibility for adjusted relative WIS or MAE over the most recent 10 week period, a model must have submitted forecasts 50% or more of the evaluated forecasts in the most recent evaluation period. WIS was only calculated for teams that submitted all required quantiles.

In order to be eligible for the historical calculation of MAE or WIS, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.

Coverage Table

For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations for a 4 week horizon.

To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.To view a specific time of interest, highlight that section on the graph or use the zoom functionality.

1 Week Horizon

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

The black line represents 95%

4 Week Horizon 95% Coverage

The black line represents 95%

Truth data

This plot shows the observed number of incident deaths over the evaluation period.